Chapter 5 Results
In this section, we will detail our analysis to the questions of interest mentioned in the introduction and gain preliminary insights through exploratory data analysis and visualization. We have divided it into four subsections that aim to answer the questions through a variety of different visualization.
Spatial Data Analysis Demand and Price Analysis User Review (Textual Data) Mining Other Interesting Insights
5.1 Spatial Data Analysis
In this part, we will explore some basic variables from our dataset using spatial visualizations and will answer questions relating to density of restaurants, variations in opening or not, and cuisine. We will do all these detailed analysis based on a subset of data in Arizona.
5.1.1 Whole Dataset
Before we are going to show you some interesting findings, let’s have a look at our whole data.
yelp_grouped <- yelp_restaurant_data %>% group_by(state) %>%
summarise(long=mean(longitude),lat=mean(latitude),n=n())
leaflet(data = yelp_grouped)%>%
setView( lat=37, lng=-97 , zoom=4)%>%
addTiles() %>%
addCircleMarkers(~long, ~lat,label=~state,radius = ~n/600,stroke=FALSE,fillOpacity=0.5,
popup = paste("State: ", yelp_grouped$state,
As we can see, Airbnb only provides data of several cities like Montreal and Waterloo in Canada, Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland in U.S., so we will focus on a single area to do our analysis, and we decide to use Arizona which includes city Phoenix with biggest dataset.
5.1.2 Arizona Dataset
library(rgdal)
# group the data by zipcode:
yelp_AZ_postal<- yelp_AZ %>% group_by(postal_code) %>% summarise(n=n())
yelp_AZ_postal <- yelp_AZ_postal[c(-1,-2),]
names(yelp_AZ_postal) <- c('GEOID10','n')
# Read shapefile:
my_zip <- readOGR(
dsn= path.expand("Data/zip_shape") ,
layer="cb_2018_us_zcta510_500k",
verbose=FALSE
)
# Create a color palett for the map:
my_zip@data <- left_join(my_zip@data,yelp_AZ_postal,by='GEOID10')
mypalette <- colorNumeric( palette="viridis", domain=my_zip@data$n, na.color="transparent")
mypalette(c(45,43))
# choropleth map:
m <- leaflet(my_zip) %>%
addTiles() %>%
setView( lat=33.6, lng=-112 , zoom=9) %>%
addPolygons( fillColor = ~mypalette(n), stroke=FALSE, fillOpacity=0.5) %>%
addLegend("topright", pal = mypalette, values = my_zip@data$n, title = "quantity")
m
Choropleth: Density of restaurants in Arizona.
This is a plot with all the restaurants in Arizona grouped in zipcode clusters. You can further click on each restaurant to see details like Name, address, stars and is open or not. This visualization helps us to basicly understand how restaurants are distributed across zipcode zones. We can see from the map that maximum restaurants are clustered around Tempo, and Scottsdale, followed by some areas around Phoenix, and the area with fewer restaurants are a little bit far from Phoenix city.
Now, because the epidemic is very serious, we really want to know if restaurants are still open. Here, we work with the is_open data within the dataset to see if there is any pattern about opening or not. In this interactive plot, you can click on the circles to see their basic informatin.
AZ_palette <- colorFactor(palette='RdYlBu', domain=yelp_AZ$is_open, levels = NULL, ordered = FALSE,
na.color = "#808080", alpha = FALSE, reverse = FALSE)
leaflet(data = yelp_AZ)%>%
addTiles() %>%
addCircleMarkers(~longitude, ~latitude,label=~name,radius = 5,stroke=FALSE, fillColor = ~AZ_palette(is_open),
popup = paste("Name: ",yelp_AZ$name,
"<br /> Address:" ,yelp_AZ$address,
'<br /> Stars: ',yelp_AZ$stars,
'<br /> Open: ',yelp_AZ$is_open)) %>%
addLegend("topright", pal = AZ_palette, values = yelp_AZ$is_open, title = "Is open")
From the plot, it is easy find differences between open restaurants and closed restaurants. For example, there are more open restaurants than closed restaurants, and they are more widely distributed. Also, closed restaurants are mostly concentrated in the city or town centers. This makes sense, because these areas are more dangerous than other areas, thus people may tend to suspend business to protect themselves. And since there are more restaurants located there, the probability of being closed must be lager than in other areas.
It is interesting to observe these patterns, and if you are living in Phoenix, you may have some deeper and personal findings from this graph.
What we are also interested in is how the ratings are related with geographic data. We all know that people can rate restaurants on Yelp, and if we want to find somewhere to eat, we must notice that score. The stars more or less influenced our decisions.
star_palette <- colorNumeric(palette="plasma", domain=yelp_AZ$stars, na.color="transparent")
leaflet(data = yelp_AZ)%>%
addTiles() %>%
addCircleMarkers(~longitude, ~latitude,label=~name,radius = 5,stroke=FALSE,fillColor = ~star_palette(stars),
popup = paste("Name: ",yelp_AZ$name,
"<br /> Address:" ,yelp_AZ$address,
'<br /> Stars: ',yelp_AZ$stars,
'<br /> Open: ',yelp_AZ$is_open)) %>%
addLegend("topright", pal = star_palette, values = yelp_AZ$stars, title = "Stars")
The graph shows that the centre of Phoenix, a small area in Tempo, Scottsdale, and the upright area in AZ have more stars. Glendale, and unexpectly, area between Phoenix and Tempo have least stars. Interestingly, if we zoom in the map, we will find that the restaurants with very few stars are all in an airport called ‘Phoenix Sky Harbor International Airport’. (And now it seems make sense :)
After we have seen the overall distribution, we may have question: Is the star distriburion related to the cuisine? We choose the top3 amount cuisines and let’s have a look.
bool_american <- str_detect(yelp_AZ$country,'american')
yelp_AZ_american <- yelp_AZ[bool_american,]
yelp_AZ_american <- subset(yelp_AZ_american,!is.na(yelp_AZ_american$business_id))
leaflet(data = yelp_AZ_american)%>%
addTiles() %>%
addCircleMarkers(~longitude, ~latitude,label=~name,radius = 5,stroke=FALSE,fillColor = ~star_palette(stars),
popup = paste("Name: ",yelp_AZ_american$name,
"<br /> Address:" ,yelp_AZ_american$address,
'<br /> Stars: ',yelp_AZ_american$stars,
'<br /> Open: ',yelp_AZ_american$is_open)) %>%
addLegend("topright", pal = star_palette, values = yelp_AZ_american$stars, title = "Stars")
We found that different cuisines actually have similar score distributions. The star may be more related to location than cuisine. Let us look at the overall distribution of cuisine in AZ. Here we choose the top9 amount cuisines.
This plot shows clearly that there are some clusters of mexican restaurants, american restaurants are located everywhere, chinese restaurants have a trend to the downright, while indian are mostly located on the top and downright.
5.2 Time Analysis
The following plot shows how opening hours are different from different days in a week.

5.3 Yelp Rating Analysis
5.3.0.1 Overview of Stars Rating graphical distribution
In This section, we will mainly focusing on analyzing the distribution and characteristics of stars rating in Yelp data. The aim of this part is to answer several question: Is the rating distribution has geographical features? What is the distribution of Ratings among different Restaurant type? What is the distribution of Ratings among different country style? And what is the relationship between stars rating and the price level of this restaurant?
First of all, we want to look deep into the Geographical distribution characteristics for yelp stars rating data, to study the relationship between state distribution and the restaurants’ stars rating.

micromapST(d, panelDesc01, sortVar=13,ascend = FALSE,
title=c("States Average Stars Rating in 2020",
"Map distribution & cleverdot plot")
)
In this section, we use micromapST packages to analyze the yelp stars rating dataset. As for this packages, we can have a clear and straightforward idea about the state geographical spot and how it related to the attribute we are looking for. Due to the data size limitation, we have to consider the influence from too small data size for some states which can dramatically distort the relationship and results. Accordingly, we choose stars ratings, state, and review count as data input.
In each of the above figures, we can see three columns: the left one is the the map of state in US which colored state means that this state is chosen, the second column is cleverdot plot for average stars ratings in each state, and the last column is about the total number of review counts in each state. The first figure is sorted by ratings, and the second one is sorted by counts.
In the first picture, we can see that the average ratings of each state ranging from 1-5, mainly concentrated around 3.5 and it seems that there is some relation among states distribution. However, we have to take number of counts into consideration, so we switch the sorting method from ratings to ratings to total number. In the second picture, we can see that for total number of counts large enough stats (which can draw a more stable and more convicing idea), the stars rating is relatively constant around 3.5 with no obvious geographical features. So, we can come to the conclusion that the stars ratings in Yelp do not present obvious geographical features among states. And for deeper analysis in stars ratings, we choose sub-dataset in Arizona state as a representative, which has top2 capacity of total review counts.
5.3.0.2 Stars Rating Analysis in Arizona

yelp_az <- yelp_clean_data[yelp_clean_data$state == "AZ",]
ggplot(yelp_az, aes(x=stars)) + geom_bar()
In This section, we will mainly focus on the sub-dataset in Arizona State to represent the total ratings, as we conclude that there is limited difference in stars ratings among different states.
In the fist place, we draw a overview for stars distribution in Arizona State, we use ggplotbar chart to display the distribution of this uncontinuous variable of stars rating, and we can see a left-skewed normal distribution and the dashed blue line is the mean for ratings. In the following part, we will look into the relationship of between ratings and Restaurant type and Country style.
5.3.0.2.1 1. The distribution of Ratings among different Restaurant category

ggplot(yelp_az_tidy,aes(stars))+geom_bar()+facet_wrap(~category)
ggboxplot(yelp_az_tidy)
In this section, Let’s take a close look at the relationship between The distribution of Ratings and different Restaurant category. We use ggplotbar chart and ggboxplotas our visualization tools. The Restaurant category means the main business of this restaurant, which includes alcohol/brunch/drinks/meat/vegan etc.. The question is whether the rating distribution is constant or will be affected by different main business.
In the first chart, we can see rating distribution bar chart grouped by different Restaurant category. The different average height in different plot is driven by the size of data. We can see that almost every Restaurant category shows a left-skewed normal distribution in ratings, which means people are more tending to give lower rating that higher rating centered around 3.5-4, and this is constant with the total distribution.
In the second chart, we can see that rating distribution boxplot grouped by different Restaurant category. In this figure, we can see more quantitative for the difference among categories. It showes that dessert/drinks/organics often has higher average ratings compared to others, specially for organics(But organics has relatively low datasize and this will challenge the calidity of the results). And for meat, although has almost same average ratings with others, it has wider range of rating distribution and less-likely normal distribution, which can indicate that people may perform more diversified opions toword meat restaurant.
5.3.0.2.2 2. The distribution of Ratings among different Countries


country_data$american <- ifelse(str_detect(country_data$country,"american"),1,0)
country_tidy <- gather(data = country_data, key = "Country", value = "value", 4:16)
ggplot(yelp_az_tidy,aes(stars))+geom_bar()+facet_wrap(~category)
ggboxplot(yelp_az_tidy)
In this section, There is another special perspective we’d like to explore while analyzing the relationship between stars ratings and country styles. We use ggplotbar chart and ggboxplotas our visualization tools as well. The country style means the food style origin of these restaurant, which includes American/Mexican/Chinese/Japanese etc.(We choose top16 country for analysis). The question is whether the rating distribution is constant or will be affected by different country style.
In the first chart, we can see rating distribution bar chart grouped by different country style. The different average height in different plot is driven by the size of data. We can see that almost every country style shows a left-skewed normal distribution in ratings. And we can also get the information that American and Mexican food has a predominated role in US compared to others.
In the second chart, we can see that rating distribution boxplot grouped by different country style. In this figure, we can see more quantitative for the difference among countries. It shows that the stars rating distribution is relatively stable and constant among different country styles, and we can divide it into two parts: one part for first 4 country style (American/Mexican/Chinese/Italian), and another group for other country styles. The first groups has relatively higher datasize, lower average ratings and wider rating distribution. The second group often shows higher average ratings, narrower rating distribution and also lower data size.
5.3.0.2.3 3. Stars Rating VS Price level

count_data <- price_data %>% count(price,stars)
ggplot(count_data, aes(price, stars, fill= n)) +geom_tile()
In this section, Let’s take a close look at the relationship between Stars Ratings and Restaurant price level. In our dataset, we have price level and ratings for each restaurant, so we decide to use geom_tile()heatmap to visualize this relationship. The price level is ranging from 1-4($,$$,$$$,$$$$), and the stars is ranging from 1-5 (minimum interval of 0.5).
As shown in the figure, for low price level (1/2, represents $/$$in yelp website), stars rating are much more intended to be normal-distributed with higher tendency for moderate ratings not extremely high praise or low ratings; However, for low price level (3/4), the data size is decreasing and also, the distribution of ratings are more like constant not normal, which may represents that not higher price equal to higher rating and more pleasant enjoyment.
5.4 Restaurant Services and Attribute Analysis
In this section, we will analyze some attributes and services provided by the restaurants, in order to illustrate whether a certain features will affect a restaurant’s rating or pricing. Particularly, we are curious about how much the supplementary services besides food itself account for the the difference between high-rated and low-rated, expensive and economical places.
Firstly, in order to study the relationship between supplementary services and the restaurants’ pricing, let’s look at an overview grouped bar chart. The x axis is four value-added services that we will pay attention to in the following analysis: Parking (whether there is convenient parking area near the restaurant), Outdoor Seats (whether the restaurant provides seats for outdoor waiting), WiFi (whether there provides WiFi service for clients), TV (whether the restaurant is equipped with one or more TV for entertainment). The y axis shows the number counts. And the two colors stand for “yes or no” conditions.
From the graph, we could know that generally, the most of restaurants are pricing at $$ level and at the $$$ or higher level, there are relatively few restaurants. As for the services, it’s common for the four pricing levels that the number of restaurants which provide parking area is higher than the one that not, which is in the same condition as the WiFi service. However, for the outdoor seats, the high pricing level restaurants tend to set up, while the majority of economical level places don’t.

Let’s take a close look at the relationship between the pricing and the supplementary services. By plotting the alluvial graph, we’d like to address whether the higher prices are set, the more value-added services are provided.
The most obvious finding is that higher proportion of $$$ and $$$$ level restaurants offer convenient parking area or parking services than the other two levels. Considering the reasons, this may be related to the locations of higher pricing level restaurants. Because of the expensive prices, they either have more budgets to locate at a convenient place, or have to do so in order to match up to their overall standing. As for outdoor seats, generally a half of all restaurants provide while half not, and there is not a significant pattern among different pricing levels. It’s easy to understand because not every restaurant needs to enter after a long time of waiting. The same data result happens to WiFi, too. It seems quite confusing to us at the beginning as we thought that the majority of restaurants provide WiFi services nowadays, based our intuition and experience. However, it also makes sense that restaurants vacillate between providing a nice amenity and dragging clients’ attention from the virtual world back to their food. However, the situation of TV equipment is totally different. Most of restaurants have one or more TVs in their store, which a scientific findings may help to explain: people tends to eat more while watching a video or shows.

There is another special perspective we’d like to explore while analyzing the relationship between price and services. In our mind, we meet more “rules”, like dressing code, and behave more conservatively at some expensive restaurants. Is this true, or just stereotypes? The alluvial plot mainly focus on three attributes: Attire (whether the restaurant requires a dress code), AppointmentOnly (whether customers can access the restaurant by appointment only or not), Noise (how’s the noise level inside the restaurant).
From the alluvial plot, we can see that although higher proportion of pricey restaurants tend to set a “dressy” attire requirement, not every $$$ and $$$$ restaurants have a dress code. Mostly, $$ and $ price level restaurants allow casual attire, while a few of them require dressy code. As for appointment, very few of restaurants are accessed only by appointment, which are mainly from $$ and $$$ price level. Sometimes, restaurants with distinguishing membership or extreme popularity may need appointment only regulations to manage their customer volume and essential ingredient purchase. When it comes to noise, this is a joint effect of environment and customers’ behavior. Most of the pricey restaurants have average noise, while the economical restaurants are more inclined to go extreme, either quiet or loud. Providing a relatively quiet environment and also making customers feel comfortable to chat and chill while eating is an art of running a perfect restaurant.

As we conclude above, various pricing level restaurants have some differences in supplementary services providing and consumer regulations. Now, let’s further question whether thess differences matter in the restaurant ratings.
On consideration of the distribution, we divided the ratings into 5 level groups. As the alluvial plot shows, the different color flows go roughly even into each categories, which demonstrates that there is not a strong relationship between restaurant ratings and value-added services. On one hand, technology development helps restaurants to learn from each other faster and renders them easy to update these hardware. Thus, when the supplementary services become common, it’s hard to distinguish from others by providing the same thing. On the other hand, value-added services include but are not limited to these four factors. Besides the basic amenities, customers may care more about some true value-added and distinctive features, like stylish decorations, meticulous waiters.

5.5 Covid Impact Analysis
In this section, we analyze how restaurants were affected by nCovid-19.
The covid dataset from Yelp includes how restaurants respond to the pandemic and the main purpose of this section is to explore how the demographic and behavioral data of restaurats are correlated with the Covid effects.
data["Open after covid"] <- (as.logical(data$`delivery or takeout`) | as.logical(data$`Grubhub enabled`)) & data$is_open
Since the original datasets give no information on the Opening status of restaurants during the pandemic, I created a new column “Open after covid” indicating whether restaurants are still open inferring from other variables.
In particular, there are three conditions I looked into: whether the restaurant is still open before the pandemic, whether the restaurants offer delievery service, and whether the restaurant collarborate with Doordash. The process is printed as above.
With the information about the opening status of restaurants, I am interested in how it is distributed over different states. Bellow is a Cleveland dot plot of opening status of restaurants over States faceted by delivery availability.
We can clearly see that the opened restaurants during Covid has much higher frequency in delivery services and those closed during Covid has higher frequency in non-delivery.
Moreover, if we just look at the opened restaurants panel, we can see that the number of non-delivery restaurants are very low (very close to 0) and only exists in the states where more data are collected. On the other hand, if we look at the closed restaurants panel, the differences between delivery available versus not avalaible restayrants do not differ much in the states where more data are collected.

To further analyze the problem, bellow is a Parallel Coordinate Plots where the x-axis is a series of variable related to Covid impacts and restaurants attributes. I colored them using opening status.
From the graph, we can see that the relationship between the variables are relatively random: there is no strong evidence of either postive relationships or negative relationships. However, by coloring the data using opening status, we can see that most non-open restaurants do not have specific policies against Covid impacts and they do not provide delivery options. The rating (star) of the restaurants seems to be not correlated with opening status according the plot as the non-open restaurants have similar number of stored distributed over each star level. As for the review count, since the data is very imbalances: having large number of low review counts, we cannot infer its relationship with other variables.

Lastly, we are interested in the difference of Covid impacts over different States. Bellow is the Statebin plot of open store proportions.
As mentioned above, we have a strong imbalance in our data. Some States only have a few data points while some States have a large number of data collected. Trying to alleviate this problem, instead of ploting the frequencies of opening restaurants in each States, we decided to compute the proportion of open restaurants (number of open restaurants over number of total restaurants recorded in the data in each state).
Since not all States are included in our data, we only plotted the States that exists in the dataset. We did not fill in the rest of the States because by marking them with a color or NA can be misleading while comparing differences between States.
From the Statebin plot, we see that NY has a relatively high proportion of open restaurants followed by TX, OH and NC. There are several States with 100% proportions and these are likely outliers: 100% occurs mostly because there are only a few data collected and these collected restaurants happened to be opening.
